Data Validation in Excel and R

Author

Eric R. Scott

Published

August 24, 2022

Overview

There are many opportunities for human or instrument error to affect data. Ideally, you want to find those errors and fix them early and often! This workshop introduces some tools in Excel and R to avoid making mistakes in data entry and data collection, and to detect the ones you inadvertently make.

Learning Objectives

  • Understand best practices for entering data and fixing errors in data

  • Use Excel data validation tools to prevent data entry errors

  • Compare data entered by two people in R to check for data entry mistakes

  • Explore data summaries to check for errors

  • Get the gist of how you can use the pointblank package to perform data validation checks

We are only interested in fixing verifiable mistakes! It is generally not appropriate to remove or edit outliers—extreme values that may or may not be accurate.

Software needed

You’ll need access to Excel, R, RStudio, and the following R packages:

library(tidyverse)
library(visdat)
library(pointblank)
library(readxl)
library(skimr)

If you can, install the development version of pointblank using this R code:

if(!require("remotes")) {
  install.packages("remotes")
} 
remotes::install_github("rich-iannone/pointblank")

Avoiding mistakes in data entry

Set up validation tools in your data entry spreadsheet to stop data entry errors in their tracks!

gif of cat typing furiously on a laptop

Data Validation Tools in Excel

Data Validation dialog box in Excel.  Settings tab is shown with dropdown menu for Allow:

  • Select a column (or cells) and choose Data > Validation … from the menu

  • Use “list” to restrict to specific values for categorical data

  • Use “whole number” for count data

  • Can also be set up after data entry. Highlight invalid data with “Circle Invalid Data” from toolbar

Watch out for Excel autocorrect!

Nature headline: "Autocorrect errors in Excel still creating genomics headache.  Despite geneticists being warned about spreadsheet problems, 30% of published papers contain mangled gene names in supplementary data.

Confused anime guy with butterfly meme where the guy has a Microsoft Excel logo on his face, the butterfly is "any data at all" and the caption is "is this a date?"

To stop Excel from converting entries to dates:

  1. Explicitly set all column types to numeric, text, date, etc.

  2. Make sure no columns are set to “general”

Double-entry Method

  • Two people enter the same data, then compare programatically.

  • In the data folder, there are two versions of a dataset—one entered by Eric and one entered by Jessica.

eric <- read_excel("data/data_eric.xlsx")
jessica <- read_excel("data/data_jessica.xlsx")

Compare visually with visdat

We can compare them a couple of ways. First, we can compare them visually using the visdat package. This only works if the two datasets are the same dimensions.

vis_compare(eric, jessica)

Compare with dplyr::anti_join()

First add row numbers to make it easier to find mistakes in Excel.

# add rownumbers that match Excel (headers are row 1)
eric    <- eric    |> mutate(row = 2:(n()+1), .before = plot)
jessica <- jessica |> mutate(row = 2:(n()+1), .before = plot)

anti_join() takes two data frames and returns only rows that differ between them.

#values in `eric` that are different in `jessica`
anti_join(eric, jessica)
Joining, by = c("row", "plot", "plant_id", "shts_2000", "ht_2000", "flwr_2000",
"shts_2001", "ht_2001", "flwr_2001")
# A tibble: 7 × 9
    row plot  plant_id shts_2000 ht_2000 flwr_2000 shts_2001 ht_2001 flwr_2001
  <int> <chr>    <dbl>     <dbl>   <dbl> <chr>         <dbl>   <dbl> <chr>    
1    14 A           33         1      27 <NA>              2      23 <NA>     
2    96 B          142         3      56 yse               4      52 <NA>     
3   121 B          162         2      19 <NA>              3      15 <NA>     
4   159 C          226         2      38 <NA>              1      32 <NA>     
5   167 C          235         1       7 <NA>              1       8 <NA>     
6   168 C          236         1      12 <NA>              1      13 <NA>     
7   225 E          306         1      13 <NA>             NA      NA <NA>     
#values in `jessica` that are different in `eric`
anti_join(jessica, eric)
Joining, by = c("row", "plot", "plant_id", "shts_2000", "ht_2000", "flwr_2000",
"shts_2001", "ht_2001", "flwr_2001")
# A tibble: 7 × 9
    row plot  plant_id shts_2000 ht_2000 flwr_2000 shts_2001 ht_2001 flwr_2001
  <int> <chr>    <dbl>     <dbl>   <dbl> <chr>         <dbl>   <dbl> <chr>    
1    14 A           33         1      21 <NA>              2      23 <NA>     
2    96 B          142         3      56 yes               4      52 <NA>     
3   121 C          162         2      19 <NA>              3      15 <NA>     
4   159 C          226         2      33 <NA>              1      32 <NA>     
5   167 C          233         1       8 <NA>              1       7 <NA>     
6   168 C          234         1       6 <NA>             NA      NA <NA>     
7   225 E          306         1      13 <NA>              1      12 <NA>     

Errors include:

  • row 14: messy handwriting? (21 or 27)
  • row 96: typo in flwr_2020
  • row 121: discrepency in plot ID
  • row 159: messy handwriting? (33 or 38)
  • row 167 & 168: completely different rows
  • row 225: missing data in Eric’s version
#after fixing data-entry errors, we get `data_resolved.csv`
plants <- read_excel("data/data_resolved.xlsx")

Explore data summaries

  • You can’t check for errors if you don’t get to know your data!
  • Use skimr::skim() to get a nicely formatted summary
  • Look for number of unique values for categorical variables
  • Look for long tails or strange patterns in mini-histograms for numeric variables
skimr::skim(plants)
Data summary
Name plants
Number of rows 247
Number of columns 8
_______________________
Column type frequency:
character 3
numeric 5
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
plot 0 1.00 1 1 0 5 0
flwr_2000 243 0.02 3 3 0 1 0
flwr_2001 245 0.01 3 3 0 1 0

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
plant_id 0 1.00 167.79 97.25 1.0 83.5 166 247.5 337 ▇▇▇▇▇
shts_2000 26 0.89 2.06 1.07 1.0 1.0 2 3.0 5 ▇▇▃▂▁
ht_2000 25 0.90 26.95 13.85 1.1 17.0 24 34.0 91 ▆▇▃▁▁
shts_2001 26 0.89 2.10 1.11 1.0 1.0 2 3.0 6 ▇▂▁▁▁
ht_2001 26 0.89 26.88 14.47 5.0 16.0 24 35.0 80 ▇▇▃▁▁

Or get a more detailed breakdown by running skim() on a grouped data frame:

plants |> 
  group_by(plot) |> 
  skim()
Data summary
Name group_by(plants, plot)
Number of rows 247
Number of columns 8
_______________________
Column type frequency:
character 2
numeric 5
________________________
Group variables plot

Variable type: character

skim_variable plot n_missing complete_rate min max empty n_unique whitespace
flwr_2000 A 54 0.02 3 3 0 1 0
flwr_2000 B 63 0.02 3 3 0 1 0
flwr_2000 C 65 0.02 3 3 0 1 0
flwr_2000 D 35 0.03 3 3 0 1 0
flwr_2000 E 26 0.00 NA NA 0 0 0
flwr_2001 A 55 0.00 NA NA 0 0 0
flwr_2001 B 64 0.00 NA NA 0 0 0
flwr_2001 C 66 0.00 NA NA 0 0 0
flwr_2001 D 34 0.06 3 3 0 1 0
flwr_2001 E 26 0.00 NA NA 0 0 0

Variable type: numeric

skim_variable plot n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
plant_id A 0 1.00 37.18 22.51 1.0 17.50 35.0 57.50 73 ▇▆▆▆▇
plant_id B 0 1.00 118.34 25.71 75.0 97.50 119.5 141.25 160 ▆▇▆▆▇
plant_id C 0 1.00 206.36 25.24 162.0 188.25 208.5 227.75 246 ▇▅▇▇▇
plant_id D 0 1.00 274.17 13.41 249.0 263.75 275.5 285.25 299 ▅▇▇▇▅
plant_id E 0 1.00 320.54 11.07 304.0 310.25 319.5 330.75 337 ▇▅▅▅▇
shts_2000 A 4 0.93 2.08 1.02 1.0 1.00 2.0 2.50 5 ▆▇▂▂▁
shts_2000 B 8 0.88 1.93 0.85 1.0 1.00 2.0 2.00 4 ▆▇▁▃▁
shts_2000 C 8 0.88 2.28 1.28 1.0 1.00 2.0 3.00 5 ▇▇▃▂▂
shts_2000 D 4 0.89 1.69 1.03 1.0 1.00 1.0 2.00 5 ▇▅▁▁▁
shts_2000 E 2 0.92 2.29 1.00 1.0 1.75 2.0 3.00 4 ▆▇▁▇▃
ht_2000 A 3 0.95 28.56 15.24 9.0 17.00 24.0 35.00 74 ▇▇▃▁▂
ht_2000 B 8 0.88 25.30 10.44 8.0 18.00 23.5 32.25 56 ▅▇▃▂▁
ht_2000 C 8 0.88 27.84 15.43 6.0 18.25 24.5 33.75 91 ▇▇▂▁▁
ht_2000 D 4 0.89 26.75 15.08 8.0 14.00 28.0 38.25 60 ▇▆▃▂▃
ht_2000 E 2 0.92 25.42 12.45 1.1 16.00 25.0 34.25 49 ▂▇▅▅▃
shts_2001 A 5 0.91 2.32 1.13 1.0 1.00 2.0 3.00 5 ▇▇▇▂▂
shts_2001 B 6 0.91 1.84 0.83 1.0 1.00 2.0 2.00 4 ▇▇▁▃▁
shts_2001 C 9 0.86 2.30 1.32 1.0 1.00 2.0 3.00 6 ▇▂▂▁▁
shts_2001 D 4 0.89 1.78 1.01 1.0 1.00 2.0 2.00 5 ▇▇▁▁▁
shts_2001 E 2 0.92 2.17 1.09 1.0 1.00 2.0 3.00 4 ▇▇▁▃▃
ht_2001 A 5 0.91 30.90 17.27 5.0 19.25 27.5 38.00 80 ▆▇▃▁▁
ht_2001 B 6 0.91 23.55 10.84 6.0 16.00 22.5 29.50 52 ▅▇▅▃▁
ht_2001 C 9 0.86 27.53 14.41 7.0 16.00 25.0 36.00 71 ▇▆▅▂▁
ht_2001 D 4 0.89 25.59 16.20 8.0 13.75 19.5 31.25 70 ▇▃▁▂▁
ht_2001 E 2 0.92 26.71 12.28 10.0 16.00 25.0 36.00 49 ▇▆▂▃▅

Explore data visually

  • visdat::vis_guess() can help spot inconsistencies
  • I’ll change one of the plots to a number to demonstrate
#change plot in the 10th row to "1"
plants$plot[10] <- 1
#doesn't change the type of the column
class(plants$plot)
[1] "character"
#but vis_guess() spots the mistake!
visdat::vis_guess(plants)

It also spots a decimal in the ht_2000 column (but it’s hard to see the green line)

Data validation pipelines with pointblank

library(pointblank)
  • pointblank provides 6 (six!) workflows for validating data

  • The Data Quality Reporting Workflow (VALID-1) is probably most useful for this group

  • Start with a data frame, create an “agent”, tell it what to expect of your data with validation functions, and let it “interrogate” your data

  • Output is a HTML table with buttons to download CSV files of any data that didn’t pass your validations

pointblank demo

  1. Decide on “action levels”. Can set a number or fraction of rows as a threshold for a warning or error
al <- action_levels(warn_at = 1, stop_at = .02)
al
-- The `action_levels` settings
WARN failure threshold of 1test units.
STOP failure threshold of 0.02 of all test units.
----
  1. Create agent
agent <- 
  create_agent(
    tbl = plants, #our data example from before
    label = "plants 2000 & 2001",
    actions = al
  )
  1. Specify validation conditions

    • Basic checks on column types with col_is_*() functions
    • Check column values with col_vals_*() functions
    • Check rows (e.g. duplicate rows) with rows_*() functions
agent_informed <- 
  agent |> 
  col_is_character(columns = plot) |>  #plot should be character
  col_is_numeric(columns = c(shts_2000, shts_2001)) |> #shts should be numeric
  
  col_vals_in_set(columns = plot, set = LETTERS[1:10]) |> #plot should be A-E
  col_vals_lt(  #expect shts < 5
    columns = c(shts_2000, shts_2001), 
    value =  5,
    na_pass = TRUE
  ) |> 
  
  rows_distinct(columns = vars(plant_id)) #no duplicate plant IDs

#'TODO: ^this should work as just `columns = plant_id` but doesn't due to a bug.  
#'Change if this gets fixed before workshop. 
#'https://github.com/rich-iannone/pointblank/issues/416
  1. Interrogate!
agent_informed |> interrogate()
Pointblank Validation
plants 2000 & 2001

tibble plantsWARN 1 STOP 0.02 NOTIFY
STEP COLUMNS VALUES TBL EVAL UNITS PASS FAIL W S N EXT
1
col_is_character
 col_is_character()

plot

1 1
1
0
0

2
col_is_numeric
 col_is_numeric()

shts_2000

1 1
1
0
0

3
col_is_numeric
 col_is_numeric()

shts_2001

1 1
1
0
0

4
col_vals_in_set
 col_vals_in_set()

plot

A, B, C, D, E, F, G, H, I, J

247 246
1
1
0

5
col_vals_lt
 col_vals_lt()

shts_2000

5

247 240
1
7
0

6
col_vals_lt
 col_vals_lt()

shts_2001

5

247 239
1
8
0

7
rows_distinct
 rows_distinct()

plant_id

247 245
1
2
0

2022-06-21 12:50:10 EDT < 1 s 2022-06-21 12:50:10 EDT
Tip

Click the blue “CSV” buttons above to download a .csv file of just the rows that failed that particular validation

Flexible validations

If a validation function you need doesn’t exist, you can use col_vals_expr()

E.g. let’s add a validation that height is measured to the nearest cm.

agent_informed <-
  agent_informed |> 
  col_vals_expr(~ ht_2000 %% 1 == 0) |> 
  col_vals_expr(~ ht_2001 %% 1 == 0) 

agent_informed |> interrogate()
Pointblank Validation
plants 2000 & 2001

tibble plantsWARN 1 STOP 0.02 NOTIFY
STEP COLUMNS VALUES TBL EVAL UNITS PASS FAIL W S N EXT
1
col_is_character
 col_is_character()

plot

1 1
1
0
0

2
col_is_numeric
 col_is_numeric()

shts_2000

1 1
1
0
0

3
col_is_numeric
 col_is_numeric()

shts_2001

1 1
1
0
0

4
col_vals_in_set
 col_vals_in_set()

plot

A, B, C, D, E, F, G, H, I, J

247 246
1
1
0

5
col_vals_lt
 col_vals_lt()

shts_2000

5

247 240
1
7
0

6
col_vals_lt
 col_vals_lt()

shts_2001

5

247 239
1
8
0

7
rows_distinct
 rows_distinct()

plant_id

247 245
1
2
0

8
col_vals_expr
 col_vals_expr()

ht_2000%%1 == 0

222 221
1
1
0

9
col_vals_expr
 col_vals_expr()

ht_2001%%1 == 0

221 221
1
0
0

2022-06-21 12:50:12 EDT < 1 s 2022-06-21 12:50:13 EDT

Create new columns to test on the fly

“preconditions” let you manipulate the data before a check is run within a single validation step.

E.g. check that height doesn’t change by more than 50 cm from 2000 to 2001

agent_informed |> 
  col_vals_lt(
    columns = ht_change, #doesn't exist yet 
    value = 50,
    na_pass = TRUE,
    # creates a new column on the fly:
    preconditions = function(df) mutate(df, ht_change = ht_2001 - ht_2000)
    ) |> 
  interrogate()
Pointblank Validation
plants 2000 & 2001

tibble plantsWARN 1 STOP 0.02 NOTIFY
STEP COLUMNS VALUES TBL EVAL UNITS PASS FAIL W S N EXT
1
col_is_character
 col_is_character()

plot

1 1
1
0
0

2
col_is_numeric
 col_is_numeric()

shts_2000

1 1
1
0
0

3
col_is_numeric
 col_is_numeric()

shts_2001

1 1
1
0
0

4
col_vals_in_set
 col_vals_in_set()

plot

A, B, C, D, E, F, G, H, I, J

247 246
1
1
0

5
col_vals_lt
 col_vals_lt()

shts_2000

5

247 240
1
7
0

6
col_vals_lt
 col_vals_lt()

shts_2001

5

247 239
1
8
0

7
rows_distinct
 rows_distinct()

plant_id

247 245
1
2
0

8
col_vals_expr
 col_vals_expr()

ht_2000%%1 == 0

222 221
1
1
0

9
col_vals_expr
 col_vals_expr()

ht_2001%%1 == 0

221 221
1
0
0

10
col_vals_lt
 col_vals_lt()

ht_change

50

247 246
1
1
0

2022-06-21 12:50:15 EDT < 1 s 2022-06-21 12:50:15 EDT

Publishing validation reports

  • Students, faculty, and staff at University of Arizona have access to RStudio Connect which allows you to publish an RMarkdown document to the web with a single click. (Learn More)

  • Data validation can be automated in a variety of ways. If you are interested in more advanced applications of data validation for your lab, contact us!

Fixing mistakes

  • For true mistakes in data entry (paper -> spreadsheet), probably ok to just edit raw data

  • For other errors, best practice:

    • Don’t edit raw data!
    • Record all changes to raw data (e.g. by using an R script to make them)
    • Flag observations that have been changed
    • Publish raw data, cleaning steps/scripts, and “cleaned” data
  • Keep an eye out for future workshops on data wrangling/tidying/cleaning

Help

Feel free to drop by the CCT Data Science Team office hours, which happens every Tuesday morning. We would love to help you with your R questions about date/time, and more!

You can also make an appointment with Eric to discuss this content and get troubleshooting help.